Advanced Data Visualization¶
In the last module we started building some visualizations in order to answer specific substantive questions. In this module we will look at generating more advanced graphics that have annotations, combine several graphics into one, build some interactive graphics, and also do some mapping.
Combining Plots with {patchwork}¶
Often you have to combine and place multiple graphics into a single canvas. There are a few ways to do this but the easiest way is that offered by the {patchwork} package. Let us use the diamonds data for this section, a data frame with 53940 rows and 10 variables:
Variable |
Description |
|---|---|
price |
price in US dollars ($326–$18,823) |
carat |
weight of the diamond (0.2–5.01) |
cut |
quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color |
diamond colour, from D (best) to J (worst) |
clarity |
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x |
length in mm (0–10.74) |
y |
width in mm (0–58.9) |
z |
depth in mm (0–31.8) |
depth |
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
table |
width of top of diamond relative to widest point (43–95) |
The Basics¶
To combine multiple plots, we need to save each plot with a unique name. I am calling them p1, p2, etc. Let us generate four plots, each different from all others.
# install.packages(c("patchwork", "tidylog", "leaflet", "htmltools", "highcharter"))
library(patchwork)
library(tidyverse)
library(tidylog)
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5 ✔ purrr 0.3.4
✔ tibble 3.1.6 ✔ dplyr 1.0.7
✔ tidyr 1.1.4 ✔ stringr 1.4.0
✔ readr 2.1.1 ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Attaching package: ‘tidylog’
The following objects are masked from ‘package:dplyr’:
add_count, add_tally, anti_join, count, distinct, distinct_all,
distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
full_join, group_by, group_by_all, group_by_at, group_by_if,
inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
relocate, rename, rename_all, rename_at, rename_if, rename_with,
right_join, sample_frac, sample_n, select, select_all, select_at,
select_if, semi_join, slice, slice_head, slice_max, slice_min,
slice_sample, slice_tail, summarise, summarise_all, summarise_at,
summarise_if, summarize, summarize_all, summarize_at, summarize_if,
tally, top_frac, top_n, transmute, transmute_all, transmute_at,
transmute_if, ungroup
The following objects are masked from ‘package:tidyr’:
drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
spread, uncount
The following object is masked from ‘package:stats’:
filter
options(repr.plot.width = 16, repr.plot.height = 8)
ggplot() +
geom_bar(data = diamonds,
aes(x = cut, fill = cut)
) +
labs(x = "Cut of the Diamond", y = "Frequency") +
theme(legend.position = "none") -> p1
ggplot() +
geom_bar(data = diamonds,
aes(x = color, fill = color)
) +
labs(x = "Color of the Diamond", y = "Frequency") +
theme(legend.position = "none") -> p2
ggplot() +
geom_point(data = diamonds,
aes(x = carat, y = price, color = cut)
) +
labs(x = "Weight of the Diamond", y = "Price of the Diamond", color = "") +
theme(legend.position = "bottom") -> p3
ggplot() +
geom_boxplot(data = diamonds,
aes(x = price, y = clarity, fill = cut)
) +
labs(y = "Clarity of the Diamond", x = "Price of the Diamond", fill = "") +
theme(legend.position = "bottom") -> p4
Let us see each plot in turn so we know what they look like.
p1; p2; p3; p4
Now we combine p1 through p3 on a single canvas
p1 + p2 + p3
Notice the default layout here: p1 + p2 + p3 gives us the plots all in a row.
But you may have other plans, for example, to put the scatterplot in a row all its own.
(p1 + p2) / p3
Now we have p3 in the second row, all by itself. Note that this was achieved via the / operator and by coercing p1 and p2 into a single row via (p1 + p2).
What if we used | instead?
p1 | (p2 + p3)
You ended up with two columns, the first containing only p1 and the second containing p2 and p3.
Note the difference between | and /. For example, note the following setup:
p1 | (p2 / p3)
You get p1 in one-half of the canvas, and then p2 and p3 are split into two rows in the remaining half of the canvas.
What if we wanted to squeeze in the fourth plot?
options(repr.plot.width = 16, repr.plot.height = 16)
(p1 + p2) / (p3 + p4)
Here you asked for p1 and p2 to be kept together, which led to both occupying the first row of the canvas. Then p3 and p4 were slotted into the second row.
Annotations¶
Annotations become helpful because you can add omnibus titles and tags for individual plots. For example, you can generate a common title, subtitle, caption, etc as shown below).
(p1 + p2) / (p3 + p4) +
plot_annotation(
title = 'The surprising truth about diamonds',
subtitle = 'These plots will reveal untold secrets about one of our beloved data-sets',
caption = 'Disclaimer: None of these plots are insightful',
tag_levels = c('a', '1'),
tag_prefix = 'Fig. ',
tag_sep = '.',
tag_suffix = ':'
) &
theme(
plot.tag.position = c(0, 1),
plot.tag = element_text(size = 9, hjust = 0, vjust = 0, color = "steelblue")
)
Spacing and Sizing¶
We can also tweak the sizes of individual rows and columns, control the space between plots, and so on. First up, spacing the plots with plot_spacer()
options(repr.plot.width = 16, repr.plot.height = 8)
(p1 + plot_spacer() + p2 + plot_spacer() + p3)
Sizing the plots with relative sizes?
options(repr.plot.width = 16, repr.plot.height = 16)
p1 + p2 + p3 + p4 +
plot_layout(widths = c(2, 1))
Alternatively, we could specify size with unit vectors, as shown below.
p1 + p2 + p3 + p4 +
plot_layout(
widths = c(2, 1),
heights = unit(c(5, 1), c('cm', 'null'))
)
Moving Beyond the grid¶
We can use a layout design to get a little more flexibility but still retain full control over the result. Layout designs can be done in two ways so let us see the easiest route – as a text setup. “When using the textual representation it is your responsibility to make sure that each area is rectangular. The only exception is # which denotes empty areas and can thus be of any shape.”
layout <- "
##BBBB
AACCDD
##CCDD
"
p2 + p3 + p4 + p1 +
plot_layout(design = layout)
The other path is using area() inside layout, as shown below.
layout <- c(
area(t = 2, l = 1, b = 5, r = 4),
area(t = 1, l = 3, b = 3, r = 5)
)
p3 + p4 +
plot_layout(design = layout)
Watch the specification here with wrap_plots()
layout <- '
A##
#B#
##C
'
wrap_plots(A = p1, B = p2, C = p3, design = layout)
Fixed-aspect plots¶
There are some plots that use fixed coordinates and these should not be disturbed. Here is an example where the map has fixed coordinates specified via coord_fixed(1.3)
library(urbnmapr)
ggplot() +
geom_polygon(
data = states,
aes(x = long, y = lat, group = group, fill = state_abbv)
) +
coord_fixed(1.3) +
ggthemes::theme_map() +
theme(legend.position = "none") +
labs(title = "Fixed!!") -> mymap
mymap + p1 + p2 + p3
Mapping¶
Maps are very powerful visualizations because they allow you to highlight and reflect patterns, clusters, with relative ease. For example, is poverty really higher in Appalachian counties? What about the percent of the population without health insurance? Literacy? Opioid deaths; do they follow transportation routes? What about COVID-19 cases? Maps to the rescue!
Building a map requires a few elements. First and foremost, you need some data to show on a map. Second, you need to have the geographic coordinates needed to build a map, basically the latitude and longitude of the geographies (states, cities, school districts, etc.) that you want to map. Third, you want a column that contains the names of the geographies you want to map, and these should be properly formatted (i.e., in titlecase) for displaying on the map.
Let us start by building a simple state map with the {urbnmapr} package. It comes with the necessary data for states and counties, respectively, and works well with {ggplot2}. Note the reliance on geom_polygon() now.
If you get an error message about urbnmapr not found, go ahead and install it ONCE via devtools::install_github("UrbanInstitute/urbnmapr")
head(states)
| long | lat | order | hole | piece | group | state_fips | state_abbv | state_name |
|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <int> | <lgl> | <fct> | <fct> | <chr> | <chr> | <chr> |
| -88.47323 | 31.89386 | 1 | FALSE | 1 | 01.1 | 01 | AL | Alabama |
| -88.46888 | 31.93026 | 2 | FALSE | 1 | 01.1 | 01 | AL | Alabama |
| -88.46866 | 31.93317 | 3 | FALSE | 1 | 01.1 | 01 | AL | Alabama |
| -88.45504 | 32.03972 | 4 | FALSE | 1 | 01.1 | 01 | AL | Alabama |
| -88.45496 | 32.04058 | 5 | FALSE | 1 | 01.1 | 01 | AL | Alabama |
| -88.45342 | 32.05305 | 6 | FALSE | 1 | 01.1 | 01 | AL | Alabama |
head(counties)
| long | lat | order | hole | piece | group | county_fips | state_abbv | state_fips | county_name | fips_class | state_name |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <int> | <lgl> | <fct> | <fct> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| -86.91760 | 32.66417 | 1 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.81657 | 32.66012 | 2 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.71339 | 32.66173 | 3 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.71422 | 32.70569 | 4 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.41312 | 32.70739 | 5 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.41117 | 32.40994 | 6 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
ggplot() +
geom_polygon(
data = states,
aes(
x = long,
y = lat,
group = group),
fill = "white",
color = "steelblue"
) +
coord_fixed(1.3)
Note the use of geom_polygon() and that longitude goes on the x-axis and latitude on the y-axis. I also specified the fill color and the border (via color = "")
Note that this is just an empty map with the shapes of the states, and also that Alaska and Hawaii have been moved so that they can be displayed on the map.
We could build a much better map by removing the x and y axis labels and tick marks, and setting a white background using theme_map() from the {ggthemes} package. We could also fill with some colors, say on the basis of the state_name.
Again, if you get an error with ggthemes, go ahead and install it ONCE via install.packages("ggthemes")
ggplot() +
geom_polygon(
data = states,
aes(
x = long,
y = lat,
group = group,
fill = state_name),
color = "white"
) +
coord_fixed(1.3) +
ggthemes::theme_map() +
theme(legend.position = "none")
Note the legend has been hidden here since the legend with 50 states would take up all the space on the plotting canvas!
More importantly, this is not a very useful map because it would be much better to color the map on the basis of some substantive variable such as population density, income, crime rates, health insurance coverage rates, vaccination rates, population size, and so on. But we need data for each state to fold this information in.
Let us see what information lurks in the statedata file.
head(statedata)
| year | state_fips | state_name | hhpop | horate | medhhincome |
|---|---|---|---|---|---|
| <int> | <chr> | <chr> | <int> | <dbl> | <int> |
| 2015 | 01 | Alabama | 1846380 | 0.6814329 | 44700 |
| 2015 | 02 | Alaska | 250183 | 0.6311860 | 70600 |
| 2015 | 04 | Arizona | 2463012 | 0.6206178 | 51000 |
| 2015 | 05 | Arkansas | 1144657 | 0.6546031 | 42000 |
| 2015 | 06 | California | 12895471 | 0.5372219 | 64600 |
| 2015 | 08 | Colorado | 2074517 | 0.6389425 | 63500 |
Okay, two things stand out – horate (the homeownership rate), and medhhincome (the median household income). Let us fill with median household income but to do so, we will need to join statedata to our states file. Why? Because we need coordinates to map anything and statedata does not contain coordinates.
states %>%
left_join(
statedata,
by = c("state_fips", "state_name")
) -> state.df
left_join: added 4 columns (year, hhpop, horate, medhhincome)
> rows only in x 0
> rows only in y ( 0)
> matched rows 83,933
> ========
> rows total 83,933
head(state.df)
| long | lat | order | hole | piece | group | state_fips | state_abbv | state_name | year | hhpop | horate | medhhincome |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <int> | <lgl> | <fct> | <fct> | <chr> | <chr> | <chr> | <int> | <int> | <dbl> | <int> |
| -88.47323 | 31.89386 | 1 | FALSE | 1 | 01.1 | 01 | AL | Alabama | 2015 | 1846380 | 0.6814329 | 44700 |
| -88.46888 | 31.93026 | 2 | FALSE | 1 | 01.1 | 01 | AL | Alabama | 2015 | 1846380 | 0.6814329 | 44700 |
| -88.46866 | 31.93317 | 3 | FALSE | 1 | 01.1 | 01 | AL | Alabama | 2015 | 1846380 | 0.6814329 | 44700 |
| -88.45504 | 32.03972 | 4 | FALSE | 1 | 01.1 | 01 | AL | Alabama | 2015 | 1846380 | 0.6814329 | 44700 |
| -88.45496 | 32.04058 | 5 | FALSE | 1 | 01.1 | 01 | AL | Alabama | 2015 | 1846380 | 0.6814329 | 44700 |
| -88.45342 | 32.05305 | 6 | FALSE | 1 | 01.1 | 01 | AL | Alabama | 2015 | 1846380 | 0.6814329 | 44700 |
Now we can build the map with state.df and specify fill = medhhincome inside the aes(...) command.
options(repr.plot.width = 24, repr.plot.height = 16)
ggplot() +
geom_polygon(
data = state.df,
aes(
x = long,
y = lat,
group = group,
fill = medhhincome
),
color = "white"
) +
coord_fixed(1.3) +
ggthemes::theme_map() +
labs(
title = "Median Household Income in the States (2015)",
fill = "Median Household Income"
) +
scale_fill_viridis_c(option = "magma") +
theme(
legend.position = "bottom",
legend.text = element_text(size = 14),
legend.key.width = unit(5, 'cm'),
title = element_text(size = 20, face = "bold")
)
Notice the fill color and the corresponding legend rely upon values of medhhincome, allowing a reader to get some feel for how the low versus high median household income states cluster spatially. In this case, Maryland appears to have the highest median household income, and thn some of the New England states. West Virginia, Mississippi, and Arkansas bring up the rear.
What about working with counties instead of states? Sure, let us merge countydata with the counties file and then draw the map.
counties %>%
left_join(
countydata,
by = c("county_fips")
) -> county.df
left_join: added 4 columns (year, hhpop, horate, medhhincome)
> rows only in x 0
> rows only in y ( 0)
> matched rows 208,874
> =========
> rows total 208,874
ggplot() +
geom_polygon(data = county.df,
aes(x = long, y = lat, group = group, fill = medhhincome),
color = "white", size = 0.05) +
coord_fixed(1.3) +
ggthemes::theme_map() +
theme(
legend.position = "bottom",
legend.text = element_text(size = 14),
legend.key.width = unit(5, 'cm'),
title = element_text(size = 20, face = "bold")
) +
labs(
title = "Median Household Income in the Counties (2015)",
fill = "Median Household Income"
) +
scale_fill_viridis_c(option = "magma")
Maybe you are only interested in Florida?
county.df %>%
filter(state_abbv == "FL") %>%
ggplot() +
geom_polygon(
aes(x = long, y = lat, group = group,
fill = medhhincome),
color = "white", size = 0.05
) +
coord_fixed(1.3) +
ggthemes::theme_map() +
theme(
legend.position = "bottom",
legend.text = element_text(size = 14),
legend.key.width = unit(5, 'cm'),
title = element_text(size = 20, face = "bold")
) +
labs(
title = "Median Household Income in Floria Counties (2015)",
fill = "Median Household Income"
) +
scale_fill_viridis_c(option = "plasma")
filter: removed 203,791 rows (98%), 5,083 rows remaining
Hmm, so far so good but what if the data were for some geography not bundled with {urbnmapr}, school districts or places, for example? Not a problem, we just have to go the extra mile. First we would have to find, download, and upload the shapefile. Say I am looking for places (loosely described as municipalities) in New Hampshire. Well, the {tigris} package comes in handy because it allows you to get whatever geography’s shapefiles you want. Below I am getting the shapefile for New Hampshire.
Again, if you get an error with tigris, install it ONCE via install.packages("tigris")
library(tigris)
options(tigris_use_cache = TRUE)
places(
state = "NH", cb = TRUE, year = 2018, progress_bar = FALSE
) -> places.nh
To enable
caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
Here, we asked for all places in New Hampsire via places(state = "NH", ...) and asked for the map boundaries to be those on file for 2018. Boundary files are updated every year or two, depending upon changes and state reporting.
places.nh %>%
head()
Registered S3 method overwritten by 'geojsonsf':
method from
print.geojson geojson
| STATEFP | PLACEFP | PLACENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <POLYGON [°]> | |
| 1 | 33 | 61860 | 02378088 | 1600000US3361860 | 3361860 | Pittsfield | 57 | 4114545 | 0 | POLYGON ((-71.34923 43.2990... |
| 2 | 33 | 67460 | 02629738 | 1600000US3367460 | 3367460 | Sanbornville | 57 | 4106641 | 3626 | POLYGON ((-71.04123 43.5649... |
| 3 | 33 | 87140 | 02378099 | 1600000US3387140 | 3387140 | Woodsville | 57 | 2271155 | 63295 | POLYGON ((-72.04107 44.1542... |
| 4 | 33 | 12900 | 00873567 | 1600000US3312900 | 3312900 | Claremont | 25 | 111766137 | 2323426 | POLYGON ((-72.41538 43.3802... |
| 5 | 33 | 11300 | 02378055 | 1600000US3311300 | 3311300 | Charlestown | 57 | 2107485 | 121062 | POLYGON ((-72.43447 43.2304... |
| 6 | 33 | 78340 | 02629744 | 1600000US3378340 | 3378340 | Walpole | 57 | 3152287 | 27471 | POLYGON ((-72.43535 43.0751... |
What you have is a shapefile … a document that gives you the coordinates stored in POLYGON(...) and other information such as the unique FIPS codes, the placename, land area, water area, and so on.
Okay, so now that I have the shapefile, how can I use it?
I need to fortify it so that it looks like a regular dataframe rather than the native SpatialPolygonsDataFrame format it comes in. When I go to make the map I am going to add the state shapefile too since otherwise the state’s boundary will not show up.
places.nh %>%
fortify(region = "GEOID") -> nh.df
names(nh.df)
- 'STATEFP'
- 'PLACEFP'
- 'PLACENS'
- 'AFFGEOID'
- 'GEOID'
- 'NAME'
- 'LSAD'
- 'ALAND'
- 'AWATER'
- 'geometry'
nh.df %>%
head()
| STATEFP | PLACEFP | PLACENS | AFFGEOID | GEOID | NAME | LSAD | ALAND | AWATER | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <POLYGON [°]> | |
| 1 | 33 | 61860 | 02378088 | 1600000US3361860 | 3361860 | Pittsfield | 57 | 4114545 | 0 | POLYGON ((-71.34923 43.2990... |
| 2 | 33 | 67460 | 02629738 | 1600000US3367460 | 3367460 | Sanbornville | 57 | 4106641 | 3626 | POLYGON ((-71.04123 43.5649... |
| 3 | 33 | 87140 | 02378099 | 1600000US3387140 | 3387140 | Woodsville | 57 | 2271155 | 63295 | POLYGON ((-72.04107 44.1542... |
| 4 | 33 | 12900 | 00873567 | 1600000US3312900 | 3312900 | Claremont | 25 | 111766137 | 2323426 | POLYGON ((-72.41538 43.3802... |
| 5 | 33 | 11300 | 02378055 | 1600000US3311300 | 3311300 | Charlestown | 57 | 2107485 | 121062 | POLYGON ((-72.43447 43.2304... |
| 6 | 33 | 78340 | 02629744 | 1600000US3378340 | 3378340 | Walpole | 57 | 3152287 | 27471 | POLYGON ((-72.43535 43.0751... |
In the map, instead of filter(...) I am using another command to only keep boundaries for places in New Hampshire. This is being done via data = subset(state.df, state_name == "New Hampshire")
ggplot() +
geom_polygon(
data = subset(state.df, state_name == "New Hampshire"),
aes(x = long, y = lat, group = group),
fill = "white", color = "black"
) +
geom_sf(
data = nh.df,
aes(fill = GEOID)
) +
ggthemes::theme_map() +
theme(legend.position = "none")
Of course, the fill is superficial here. But say we had some data for places in New Hampshire, maybe the size of the population, as in nh.data.RData. Now we could join nh.data with nh.df to create nh and then map. Note the join keys … GEOID in each file.
load("data/nh.data.RData")
head(nh.data)
| GEOID | NAME | population |
|---|---|---|
| <chr> | <chr> | <dbl> |
| 3300980 | Alton CDP, New Hampshire | 168 |
| 3301220 | Amherst CDP, New Hampshire | 709 |
| 3301620 | Antrim CDP, New Hampshire | 1232 |
| 3301940 | Ashland CDP, New Hampshire | 1353 |
| 3303620 | Bartlett CDP, New Hampshire | 104 |
| 3304660 | Belmont CDP, New Hampshire | 1814 |
nh.df %>%
left_join(
nh.data,
by = c("GEOID" = "GEOID")
) -> nh
left_join: added 3 columns (NAME.x, NAME.y, population)
> rows only in x 0
> rows only in y ( 0)
> matched rows 97
> ====
> rows total 97
head(nh)
| STATEFP | PLACEFP | PLACENS | AFFGEOID | GEOID | NAME.x | LSAD | ALAND | AWATER | NAME.y | population | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <chr> | <dbl> | <POLYGON [°]> | |
| 1 | 33 | 61860 | 02378088 | 1600000US3361860 | 3361860 | Pittsfield | 57 | 4114545 | 0 | Pittsfield CDP, New Hampshire | 1586 | POLYGON ((-71.34923 43.2990... |
| 2 | 33 | 67460 | 02629738 | 1600000US3367460 | 3367460 | Sanbornville | 57 | 4106641 | 3626 | Sanbornville CDP, New Hampshire | 581 | POLYGON ((-71.04123 43.5649... |
| 3 | 33 | 87140 | 02378099 | 1600000US3387140 | 3387140 | Woodsville | 57 | 2271155 | 63295 | Woodsville CDP, New Hampshire | 903 | POLYGON ((-72.04107 44.1542... |
| 4 | 33 | 12900 | 00873567 | 1600000US3312900 | 3312900 | Claremont | 25 | 111766137 | 2323426 | Claremont city, New Hampshire | 13016 | POLYGON ((-72.41538 43.3802... |
| 5 | 33 | 11300 | 02378055 | 1600000US3311300 | 3311300 | Charlestown | 57 | 2107485 | 121062 | Charlestown CDP, New Hampshire | 1029 | POLYGON ((-72.43447 43.2304... |
| 6 | 33 | 78340 | 02629744 | 1600000US3378340 | 3378340 | Walpole | 57 | 3152287 | 27471 | Walpole CDP, New Hampshire | 519 | POLYGON ((-72.43535 43.0751... |
Now we plot:
ggplot() +
geom_polygon(
data = subset(
state.df,
state_name == "New Hampshire"
),
aes(
x = long,
y = lat,
group = group
),
fill = "white",
color = "black"
) +
geom_sf(
data = nh,
aes(fill = population)
) +
coord_sf() +
scale_fill_viridis_c(option = "viridis") +
ggthemes::theme_map() +
theme(
legend.position = "bottom",
legend.text = element_text(size = 14),
legend.key.width = unit(5, 'cm'),
title = element_text(size = 20, face = "bold")
) +
labs(
fill = "Population Size",
title = "Population Distribution in New Hampshire's Places",
subtitle = "(American Community Survey, 2014-2018)"
)
You could have also filled by creating quartiles, etc., using {Santoku}, so do not forget that option.
Before we move on, one more map, to show you the possibilities. Here, I am
plotting the locations of the parking tickets issued in Philadelphia, this time with the {leaflet} package. First the tickets data-set, reduced to a random sample of 20% of the tickets issued in the month of December in 2017 (to keep the data size manageable). I have also created a popup that will display specific information if someone clicks on a point in the map.
Again, install any missing packages you are alerted about but remember that you need to do this ONCE, not every time you need to use the package.
readr::read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-03/tickets.csv"
) %>%
mutate(
year = lubridate::year(issue_datetime),
month = lubridate::month(issue_datetime)
) %>%
filter(month == 12, lon > -75.5) %>%
sample_frac(0.2) -> tickets
Rows: 1260891 Columns: 7
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): violation_desc, issuing_agency
dbl (4): fine, lat, lon, zip_code
dttm (1): issue_datetime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mutate: new variable 'year' (double) with one unique value and 0% NA
new variable 'month' (double) with 12 unique values and 0% NA
filter: removed 1,169,690 rows (93%), 91,201 rows remaining
sample_frac: removed 72,961 rows (80%), 18,240 rows remaining
tickets %>%
unite(
display,
c(issuing_agency, issue_datetime, fine),
sep = "; ",
remove = FALSE
) -> tickets
library(leaflet)
library(htmltools)
library(widgetframe)
Loading required package: htmlwidgets
lst <- list()
leaflet(tickets) %>%
addTiles() %>%
addCircles(
lng = ~ lon,
lat = ~ lat,
popup = ~htmlEscape(display),
color = "steelblue",
opacity = 0.10
) -> leaf01
leaf01 -> lst
htmltools::tagList(lst)
Voila! A few lines of code and we have an interactive map that can be used to display whatever evidence we want to display. Note that you need geographic coordinates since without them the data cannot be attached to a physical location.
Let us see another twist on this. Say I am trying to map the total number of COVID-19 cases in Ohio. I know the county and I know the number of cases that occurred, as well as the latitude and longitude of each county. Well, I can build a similar plot except making the size of the circle conditional upon the number of cases. The more the cases, the larger the radius of the circle.
readr::read_csv(
"https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
) -> covid
Rows: 2057110 Columns: 6
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): county, state, fips
dbl (2): cases, deaths
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid %>%
filter(
state == "Ohio", date == "2020-04-17"
) -> cov19
filter: removed 2,057,023 rows (>99%), 87 rows remaining
But there are no coordinates in cov19. Well, I can get those from the {housingData} package, and then merge that with cov19.
library(housingData)
geoCounty %>%
filter(state == "OH") %>%
separate(
county,
into = c("countyname", "extra"),
sep = " County",
remove = TRUE
) %>%
mutate(
countyname = stringr::str_to_sentence(countyname)
) -> oh
filter: removed 2,987 rows (97%), 88 rows remaining
mutate: changed one value (1%) of 'countyname' (0 new NA)
oh %>%
left_join(
cov19,
by = c("countyname" = "county")
) -> ohcov19
head(ohcov19)
left_join: added 7 columns (fips.x, state.x, date, state.y, fips.y, …)
> rows only in x 2
> rows only in y ( 1)
> matched rows 86
> ====
> rows total 88
| fips.x | countyname | extra | state.x | lon | lat | rMapState | rMapCounty | date | state.y | fips.y | cases | deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <fct> | <chr> | <chr> | <fct> | <dbl> | <dbl> | <fct> | <fct> | <date> | <chr> | <chr> | <dbl> | <dbl> | |
| 1 | 39001 | Adams | OH | -83.46359 | 38.85662 | ohio | adams | 2020-04-17 | Ohio | 39001 | 3 | 0 | |
| 2 | 39003 | Allen | OH | -84.10825 | 40.77675 | ohio | allen | 2020-04-17 | Ohio | 39003 | 65 | 9 | |
| 3 | 39005 | Ashland | OH | -82.26922 | 40.86122 | ohio | ashland | 2020-04-17 | Ohio | 39005 | 5 | 0 | |
| 4 | 39007 | Ashtabula | OH | -80.75647 | 41.71017 | ohio | ashtabula | 2020-04-17 | Ohio | 39007 | 54 | 4 | |
| 5 | 39009 | Athens | OH | -82.04053 | 39.34576 | ohio | athens | 2020-04-17 | Ohio | 39009 | 3 | 1 | |
| 6 | 39011 | Auglaize | OH | -84.22192 | 40.56421 | ohio | auglaize | 2020-04-17 | Ohio | 39011 | 21 | 1 |
Now I want to create a new column that shows the county name and the number of cases.
ohcov19 %>%
unite(
display,
c(countyname, cases),
sep = ": ",
remove = FALSE
) -> ohcov19
head(ohcov19)
| fips.x | display | countyname | extra | state.x | lon | lat | rMapState | rMapCounty | date | state.y | fips.y | cases | deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <fct> | <chr> | <chr> | <chr> | <fct> | <dbl> | <dbl> | <fct> | <fct> | <date> | <chr> | <chr> | <dbl> | <dbl> | |
| 1 | 39001 | Adams: 3 | Adams | OH | -83.46359 | 38.85662 | ohio | adams | 2020-04-17 | Ohio | 39001 | 3 | 0 | |
| 2 | 39003 | Allen: 65 | Allen | OH | -84.10825 | 40.77675 | ohio | allen | 2020-04-17 | Ohio | 39003 | 65 | 9 | |
| 3 | 39005 | Ashland: 5 | Ashland | OH | -82.26922 | 40.86122 | ohio | ashland | 2020-04-17 | Ohio | 39005 | 5 | 0 | |
| 4 | 39007 | Ashtabula: 54 | Ashtabula | OH | -80.75647 | 41.71017 | ohio | ashtabula | 2020-04-17 | Ohio | 39007 | 54 | 4 | |
| 5 | 39009 | Athens: 3 | Athens | OH | -82.04053 | 39.34576 | ohio | athens | 2020-04-17 | Ohio | 39009 | 3 | 1 | |
| 6 | 39011 | Auglaize: 21 | Auglaize | OH | -84.22192 | 40.56421 | ohio | auglaize | 2020-04-17 | Ohio | 39011 | 21 | 1 |
Okay, now we have everything we need to build the map.
lst <- list()
leaflet(ohcov19) %>%
addTiles() %>%
addCircleMarkers(
lng = ~ lon,
lat = ~ lat,
popup = ~htmlEscape(display),
color = "salmon",
opacity = 0.10,
radius = ~sqrt(cases)
) -> leaf02
leaf02 -> lst
htmltools::tagList(lst)
Note that the radius of the circles is being driven by the square-root of the number of cases (a convenient transformation that makes the size managable). If we relied on the cases themselves, with no square-root, we would get this, useless map because the large number of cases would overwhelm everything else, coloring everything shades of red.
lst <- list()
leaflet(ohcov19) %>%
addTiles() %>%
addCircleMarkers(
lng = ~ lon,
lat = ~ lat,
popup = ~htmlEscape(display),
color = "salmon",
opacity = 0.10,
radius = ~cases
) -> leaf02bad
leaf02bad -> lst
htmltools::tagList(lst)
Interactive Graphics with Plotly and Highcharter¶
Interactive graphics are useful in situations where you would like the user/viewer to see the data values or other details by hovering over or clicking on the graphic. Say, for example, I have a scatterplot and want to make it interactive. How can I do that?
One crude and fast way to do that is by saving my ggplot2 object and then using {plotly} to add a ggplotly() wrapper around the plot.
In the example below I am saving the plot as pl01, then wrapping it in ggplotly with ggplotly(pl01) -> lst.
Note too that the lst <- list() and htmltools::tagList(lst) commands will show up for the interactive plots, and they need to. Otherwise the plot may not render correctly in the notebook.
library(plotly)
lst <- list()
ggplot() +
geom_point(
data = mpg,
mapping = aes(
x = cty,
y = hwy,
color = trans)
) +
labs(
x = "City Mileage",
y = "Highway Mileage",
color = "Transmission"
) -> pl01
ggplotly(pl01) -> lst
htmltools::tagList(lst)
Attaching package: ‘plotly’
The following objects are masked from ‘package:tidylog’:
distinct, filter, group_by, mutate, rename, select, slice,
summarise, transmute, ungroup
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
These plots are useful when presenting data to a live audience (in a talk, or on the web).
Rather than use plotly, I prefer {highcharter} since it does a lot of things well with minimal fuss, and yet the resulting plots are aesthetically pleasing.
Let us stay with the COVID-19 example. Say I want a bar-chart of the total number of cases by state and want to do this via highcharter.
library(highcharter)
covid %>%
filter(
date == "2020-04-17"
) %>%
rename(
State = state,
`Total Cases` = cases
) -> tab1
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
lst <- list()
hchart(
tab1,
"bar",
hcaes(
x = State,
y = `Total Cases`
)
) -> hc01
hc01 -> lst
htmltools::tagList(lst)
Notice the key elements here: The basic function call is hchart() and we are specifying that we want a bar-chart, and we are also providing the quantities that should go on the x and y axis, respectively. Note that x actually ends up as the y when you specify a “bar” chart.
What if I wanted a line-chart, maybe of the number of cases over time? And I wanted this just for a few states? We could do that too, as shown below. Note that I am creating tab2, a frequency table of the number of cases by state and date, and then converting total_cases into a logarithmic form (saved as log_cases) so that we can compare the rate of change from one date to the next on a common scale.
covid %>%
filter(
state %in% c("Ohio", "Florida", "California", "New Jersey", "Ohio", "New York"),
date >= "2020-03-01"
) %>%
group_by(state, date) %>%
mutate(
log_cases = log(sum(cases))
) %>%
ungroup() -> tab2
head(tab2)
| date | county | state | fips | cases | deaths | log_cases |
|---|---|---|---|---|---|---|
| <date> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| 2020-03-01 | Alameda | California | 06001 | 1 | 0 | 3.496508 |
| 2020-03-01 | Humboldt | California | 06023 | 1 | 0 | 3.496508 |
| 2020-03-01 | Los Angeles | California | 06037 | 1 | 0 | 3.496508 |
| 2020-03-01 | Marin | California | 06041 | 1 | 0 | 3.496508 |
| 2020-03-01 | Napa | California | 06055 | 1 | 0 | 3.496508 |
| 2020-03-01 | Orange | California | 06059 | 1 | 0 | 3.496508 |
There are duplicate rows per state per date because the counties remain. I will run distinct() to get rid of them.
tab2 %>%
select(state, date, log_cases) %>%
distinct() -> tab2_nodups
head(tab2_nodups)
| state | date | log_cases |
|---|---|---|
| <chr> | <date> | <dbl> |
| California | 2020-03-01 | 3.4965076 |
| Florida | 2020-03-01 | 0.6931472 |
| New York | 2020-03-01 | 0.0000000 |
| California | 2020-03-02 | 3.6375862 |
| Florida | 2020-03-02 | 0.6931472 |
| New York | 2020-03-02 | 0.0000000 |
lst <- list()
hchart(
tab2_nodups,
"line",
hcaes(
x = date,
y = log_cases,
group = state
)
) -> hc02
hc02 -> lst
htmltools::tagList(lst)
Now here is a county-level chart that shows the total number of cases as of November 15, 2021.
The data are stored in tab3 created as shown below. Pay attention to this creation because we are not just creating a frequency table but also adding in a specific key we are calling code because we will need to join these data to the map data.
covid %>%
group_by(county, state, fips) %>%
filter(date == "2021-11-15") %>%
unite(
Location,
c(county, state),
sep = ", ",
remove = TRUE
) -> tab3
head(tab3)
| date | Location | fips | cases | deaths |
|---|---|---|---|---|
| <date> | <chr> | <chr> | <dbl> | <dbl> |
| 2021-11-15 | Autauga, Alabama | 01001 | 10407 | 154 |
| 2021-11-15 | Baldwin, Alabama | 01003 | 37875 | 581 |
| 2021-11-15 | Barbour, Alabama | 01005 | 3648 | 79 |
| 2021-11-15 | Bibb, Alabama | 01007 | 4317 | 92 |
| 2021-11-15 | Blount, Alabama | 01009 | 10536 | 188 |
| 2021-11-15 | Bullock, Alabama | 01011 | 1523 | 44 |
The next step will be to take the counties data that has the longitude\latitude, and join it to the tab3 data. But before we can do that we will need to split the county_fips variable into two – stfips and fips. Why? Because we need to create a key that can be used to join these data to the highcharter map data. In highcharter there is a variable called code that looks like “us-al-001” for Autauga county in Alabama, and so on. Such a variable does not exist in the counties data-set.
library(urbnmapr)
data(counties)
head(counties)
| long | lat | order | hole | piece | group | county_fips | state_abbv | state_fips | county_name | fips_class | state_name |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <int> | <lgl> | <fct> | <fct> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| -86.91760 | 32.66417 | 1 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.81657 | 32.66012 | 2 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.71339 | 32.66173 | 3 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.71422 | 32.70569 | 4 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.41312 | 32.70739 | 5 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.41117 | 32.40994 | 6 | FALSE | 1 | 01001.1 | 01001 | AL | 01 | Autauga County | H1 | Alabama |
counties %>%
separate(
county_fips,
into = c("stfips", "fips"),
sep = 2,
remove = FALSE
) %>%
mutate(
leader = "us",
stlower = stringr::str_to_lower(state_abbv)
) %>%
unite(
code,
c(leader, stlower, fips),
sep = "-"
) -> cdf
head(cdf)
| long | lat | order | hole | piece | group | county_fips | stfips | code | state_abbv | state_fips | county_name | fips_class | state_name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <int> | <lgl> | <fct> | <fct> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| -86.91760 | 32.66417 | 1 | FALSE | 1 | 01001.1 | 01001 | 01 | us-al-001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.81657 | 32.66012 | 2 | FALSE | 1 | 01001.1 | 01001 | 01 | us-al-001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.71339 | 32.66173 | 3 | FALSE | 1 | 01001.1 | 01001 | 01 | us-al-001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.71422 | 32.70569 | 4 | FALSE | 1 | 01001.1 | 01001 | 01 | us-al-001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.41312 | 32.70739 | 5 | FALSE | 1 | 01001.1 | 01001 | 01 | us-al-001 | AL | 01 | Autauga County | H1 | Alabama |
| -86.41117 | 32.40994 | 6 | FALSE | 1 | 01001.1 | 01001 | 01 | us-al-001 | AL | 01 | Autauga County | H1 | Alabama |
cdf %>%
select(code, county_fips) %>%
distinct() -> cdf2
head(cdf2)
| code | county_fips |
|---|---|
| <chr> | <chr> |
| us-al-001 | 01001 |
| us-al-003 | 01003 |
| us-al-005 | 01005 |
| us-al-007 | 01007 |
| us-al-009 | 01009 |
| us-al-011 | 01011 |
Now we can join cdf2 to tab3 so that the code variable will become common to the resulting data-set and highcharter!
tab3 %>%
left_join(
cdf2,
by = c("fips" = "county_fips")
) -> tab4
left_join: added one column (code)
> rows only in x 117
> rows only in y ( 9)
> matched rows 3,133
> =======
> rows total 3,250
head(tab4)
| date | Location | fips | cases | deaths | code |
|---|---|---|---|---|---|
| <date> | <chr> | <chr> | <dbl> | <dbl> | <chr> |
| 2021-11-15 | Autauga, Alabama | 01001 | 10407 | 154 | us-al-001 |
| 2021-11-15 | Baldwin, Alabama | 01003 | 37875 | 581 | us-al-003 |
| 2021-11-15 | Barbour, Alabama | 01005 | 3648 | 79 | us-al-005 |
| 2021-11-15 | Bibb, Alabama | 01007 | 4317 | 92 | us-al-007 |
| 2021-11-15 | Blount, Alabama | 01009 | 10536 | 188 | us-al-009 |
| 2021-11-15 | Bullock, Alabama | 01011 | 1523 | 44 | us-al-011 |
Here comes the map!
Note that we are asking forthe cases column to be used for the values that will color each county, and we are asking the map file be joined with hc-key in the highcharter file and code in tab4. The county birders will be in steelblue, and there will be 10 values used to create the fill color palette. The legend will be aligned right, and horizontal.
library(viridis)
lst <- list()
hcmap("countries/us/us-all-all",
data = tab4,
name = "COVID-19 Cases", value = "cases",
joinBy = c("hc-key", "code"),
borderColor = "steelblue") %>%
hc_colorAxis(stops = color_stops(10, rev(magma(10)))) %>%
hc_legend(layout = "horizontal", align = "right",
floating = TRUE, valueDecimals = 0, valueSuffix = ""
) -> hc03
hc03 -> lst
htmltools::tagList(lst)
Loading required package: viridisLite
Note that countries/us/us-all-all indicates that we want counties. If we wanted the states instead it would have been countries/us/us-all.
What if we wanted only Ohio?
Well, in that case we could subset as shown below. In particular, we are asking that if R sees the string “oh” in a variable called code, it should keep only these rows, and save these filtered rows of data as tab5.
tab4 %>%
filter(
grepl("-oh-", code)
) -> tab5
head(tab5)
| date | Location | fips | cases | deaths | code |
|---|---|---|---|---|---|
| <date> | <chr> | <chr> | <dbl> | <dbl> | <chr> |
| 2021-11-15 | Adams, Ohio | 39001 | 4360 | 105 | us-oh-001 |
| 2021-11-15 | Allen, Ohio | 39003 | 17443 | 312 | us-oh-003 |
| 2021-11-15 | Ashland, Ohio | 39005 | 7413 | 146 | us-oh-005 |
| 2021-11-15 | Ashtabula, Ohio | 39007 | 11166 | 217 | us-oh-007 |
| 2021-11-15 | Athens, Ohio | 39009 | 8083 | 93 | us-oh-009 |
| 2021-11-15 | Auglaize, Ohio | 39011 | 7334 | 110 | us-oh-011 |
lst <- list()
hcmap("countries/us/us-oh-all",
data = tab5,
name = "COVID-19 Cases", value = "cases",
joinBy = c("hc-key", "code"),
borderColor = "steelblue") %>%
hc_colorAxis(stops = color_stops(10, rev(magma(10)))) %>%
hc_legend(layout = "horizontal", align = "right",
floating = TRUE, valueDecimals = 0, valueSuffix = ""
) -> hc04
hc04 -> lst
htmltools::tagList(lst)
There you have it!
The one downside to these interactive charts is that they are best displayed in html files but in PDF and Word document they lose that interactivity. Hence you see them a lot on blogs and other web-based documents.
All of these packages have been growing so it is quite likely that as software development continues even that barrier might be eliminated.
Exercises for Practice¶
Exercise 01¶
Create a map of all the counties in New York. Be sure to title the map and to fill in each county with the total number of COVID19 cases they have seen to date. In addition, draw county borders in white. Use theme_map() and make sure the legend is at the bottom. [Hint: You will need to calculate the total number of cases per county and then join the resulting file with the counties data file to get the latitude/longitudes for the counties.]
Exercise 02¶
Run the following code chunk to load data on the murder, assault and rape rates per 100,000 persons. Urbanpop is the percent of the state population that lives in an urban area.
library(tidyverse)
data(USArrests)
names(USArrests)
USArrests$statename <- rownames(USArrests)
head(USArrests)
- 'Murder'
- 'Assault'
- 'UrbanPop'
- 'Rape'
| Murder | Assault | UrbanPop | Rape | statename | |
|---|---|---|---|---|---|
| <dbl> | <int> | <int> | <dbl> | <chr> | |
| Alabama | 13.2 | 236 | 58 | 21.2 | Alabama |
| Alaska | 10.0 | 263 | 48 | 44.5 | Alaska |
| Arizona | 8.1 | 294 | 80 | 31.0 | Arizona |
| Arkansas | 8.8 | 190 | 50 | 19.5 | Arkansas |
| California | 9.0 | 276 | 91 | 40.6 | California |
| Colorado | 7.9 | 204 | 78 | 38.7 | Colorado |
Now create a state-level map of the 50 states making sure to use UrbanPop to fill each state. Title the map and place the legend at the bottom.
Exercise 03¶
Use the USArrests data to draw scatterplots of (a) Murder versus UrbanPop, (b) Assault versus UrbanPop, and © Rape versus UrbanPop. Save each of these scatterplots by name and then use patchwork to create a single canvas that includes all three plots. Make sure you label the x-axis, y-axis, and title each plot.
Exercise 04¶
Now create highcharter versions of each of the three scatterplots you created in Exercise (3) above. You should end up with three scatterplots, each on its own canvas.